Seven Steps to Improve the Transparency of Statistical Practices | Natural Human Behavior

2021-11-16 07:57:31 By : Mr. Henry Du

Thank you for visiting Nature. The browser version you are using has limited support for CSS. For the best experience, we recommend that you use a newer version of the browser (or turn off the compatibility mode in Internet Explorer). At the same time, to ensure continued support, we will display sites without styles and JavaScript.

Nature Human Behavior Vol. 5, pp. 1473–1480 (2021) Cite this article

We believe that statistical practices in social and behavioral sciences benefit from transparency, fair recognition of uncertainty, and openness to alternative interpretations. Here, in order to promote this approach, we recommend seven specific statistical procedures: (1) visualize data; (2) quantify the uncertainty of reasoning; (3) evaluate data preprocessing options; (4) report multiple Model; (5) involving multiple analysts; (6) modestly explaining the results; (7) sharing data and code. We discussed their advantages and limitations, and provided guidelines for their adoption. Each of these seven programs finds inspiration in Merton's scientific spirit, which is reflected in the norms of communalism, universalism, selflessness, and organized skepticism. We believe that these ethical considerations-and their statistical results-establish common ground among data analysts, even though there are still disagreements on the basis of statistical inference.

A superficial evaluation of the published literature shows that statisticians rarely agree on anything. Different schools of thought—mainly frequencyists, likelihoodists, and Bayesians—have been fighting each other for decades, arguing about the meaning of "probability", arguing about the role of prior knowledge, arguing about objective analysis and subjectiveness The value of analysis, and there is disagreement on the main goal of the reasoning itself: whether the researcher should control the error rate, update beliefs, or make coherent decisions. Fundamental differences not only exist between different schools of statistics, but also exist within the same school. For example, in the frequency school, there is a long-term dispute between those who seek to pass the P-value test hypothesis and those who emphasize the estimation by the confidence interval; in the Bayesian school, Jack Good claims that there are 46,656 possible Bayesian models. Proven to underestimate 1 (but also see reference 2).

When multiple statisticians and statistical practitioners find themselves analyzing the same data set independently, this divergence is also reflected in practical applications. Specifically, recent "multiple analysts" articles show that statisticians rarely use the same analysis, and they often come to different conclusions, even for the exact same data set and research questions 3, 4, 5, 6 , 7. Conflicting guidelines regarding P-values ​​also show profound differences 8,9,10,11,12,13. Should practitioners avoid using the term "statistically significant"? Should they lower the P-value threshold, or justify them, or abandon the P-value altogether? And, if you give up the P-values, what should be used to replace them? Since statisticians are arguing over these basic issues, it may be forgivable for users of applied statistics to take a wait-and-see attitude and proceed as usual.

From this perspective, we claim that in addition to many controversial and unresolved arguments, statisticians may agree on a set of scientific norms. We put these norms first because we believe that they have considerable relevance to statistical practices in the social and behavioral sciences. We believe that the norms that should guide statistical practice are communalism, universalism, selflessness, and organized skepticism. These are the four scientific norms proposed by Merton14 (originally published in 1942; for a detailed overview of Merton’s norms, please See Box 1 and Box 2 for an overview of how each of the statistical procedures discussed here meets these specifications).

Merton14 proposed that the scientific spirit is characterized by the following four norms:

communism. "The substantive discoveries of science are the product of social cooperation and are distributed to communities. [...] The property rights of science are reduced to a minimum by the basic principles of scientific ethics. [...] The institutional concepts of science as part of the public sphere and The necessity of communicating findings is related. Confidentiality is the opposite of this norm; fully and openly communicate its formulation.” (Ref. 14, pp. 273-274).

Universalism. "[T]ruth-claims that regardless of its source, it should follow pre-determined non-personal standards: consistent with observations and previously confirmed knowledge. The acceptance or rejection of the claim to the scientific list does not depend on the individual or society of its protagonist Attributes; his race, nationality, religion, class, and personal qualities are irrelevant.” (Ref. 14, p. 270).

Selfless. "Like general occupations, science takes selflessness as a basic institutional element. [...] The passion for knowledge, the curiosity of doing nothing, and the altruistic care for human interests [...] are all attributed to scientists." (Reference Reference 14 pages 275-276).

Organized doubts. This "involves potential doubts about established conventions, authority, established procedures, and certain foundations of the general'sacred' realm. [...] Science raises factual questions about each stage of nature and society, which is the same as for these same data. Other attitudes generate psychological rather than logical conflicts. These same data have been concretized by other institutions and often ritualized. Most institutions require unconditional belief; but the scientific system makes doubt a virtue.” (Reference Reference 14 pages 264-265).

Generally speaking, when the Mertonian Norm is applied to the field of statistics, general themes include the need for transparency, recognition of uncertainty, and openness to alternative interpretations. Therefore, although the Merton Code was proposed more than half a century ago, it reflects the current desire to improve scientific transparency and repeatability. Crucially, the principles behind the Merton Code can be translated into specific statistical practices. A non-exhaustive list of these practices includes (1) visualizing data; (2) quantifying the uncertainty of reasoning; (3) evaluating data preprocessing options; (4) reporting multiple models; (5) involving multiple analysts; ( 6) Explain the results modestly; (7) Share data and codes. We believe that, with the exception of reasonable exceptions (ie, privacy issues, strict limits on time and money), most statisticians will generally support these practices85. Here, we explain these practices in more detail, including their benefits, limitations, and guidelines.

By visualizing the data, researchers can graphically represent the key aspects of the observed data and the important attributes of the applied statistical model.

Data visualization is important in all stages of the statistical workflow. In exploratory data analysis, data visualization helps researchers formulate new theories and hypotheses15. In model evaluation, data visualization supports the detection of model mismatches and guides the development of appropriate statistical models16, 17, 18, 19, 20. Finally, once the analysis is complete, the visualization of data and model fitting can be said to be the most effective way to communicate the main findings to the scientific audience21.

For an example of how data visualization facilitates the development of new hypotheses, consider the famous distribution of cholera deaths drawn by the London anesthesiologist John Snow during the cholera outbreak in Soho, London in September 1854. When the epidemic broke out, John Snow created a dot map showing the home of the deceased and the nearby water pump (Figure 1). Scattered data showed that the deaths were concentrated around a specific water pump on Broad Street, indicating that the disease was transmitted through water, not through the air22. At the request of John Snow, the water pump was removed from the handle and disabled, which immediately ended the nearby epidemic. Later, it was discovered that the well to which the pump belonged was contaminated with sewage, which caused an outbreak of an epidemic nearby.

The dot represents the home of the deceased, and the cross represents the water pump. The polluted water pump that caused the cholera outbreak is located on Broad Street. Reprinted with permission from Geographic Magazine 22.

For an example of how data visualization reveals model misspecification, consider Anscombe's Quartet 23 shown in Figure 2. All four scatter plots have the same summary statistics (ie, mean, standard deviation, and Pearson correlation coefficient value). By visually inspecting the panels, it is clear that the binary relationship of each panel is fundamentally different24.

Although these four data sets are equivalent in summary statistics, the Pearson correlation only applies to the data set in the upper left panel. The graph is available at https://www.shinyapps.org/apps/RGraphCompendium (CC BY).

Since William Playfair (1759-1823) invented the first statistical graphs (such as line and bar graphs25), data visualization has become an important part of science. Today, graphs are part of most statistical software packages and have become an indispensable tool for performing certain analyses (ie principal component analysis or prior and posterior prediction checks) or processing large data sets (ie through cluster analysis) . Now, technology allows us to go beyond static visualization and display the dynamic aspects of data, for example, using the software package R Shiny27 or iNZight28.

Despite the obvious benefits, data visualization also provides opportunities for misleading, for example, when scaling up to minimize changes or minimizing scale to highlight differences to show false patterns29, 30, 31.

In addition, the amount of information in a chart usually depends on the researcher's design abilities and their level of consideration for what information should be conveyed. Scientists with no programming experience often find themselves limited by the options provided by standard graphics software. However, the example of Anscombe's quartet shows that even the simplest plot can provide a wealth of information.

There are no uniform guidelines on when to use and which graphical representations to use. However, according to Tufte (page 92 of Reference 32), a good statistical graph has a basic principle: "show data first" (ie, minimize non-data elements). In general, scientists should strive to create a chart that is as clean, informative, and complete as possible. These characteristics are also emphasized in the ASA ethics guidelines. The guide mentions that in order to ensure the integrity of data and methods, ethical statisticians "[i]n publish and report to communicate survey results in a way that is honest and meaningful to users/readers. This includes tables, models, and Graphics" (page 3 of reference 33).

In addition, the guidelines depend on various aspects of the data (ie the complexity of the data and experimental design) and the background (compared to Reference 34); here, we recommend a large number of manuals to interested readers, which describe Good practices for graphical representation of statistics 32, 35, 36, 37, 38, 39, 40.

By reporting the accuracy of model parameter estimates, analysts communicate the inevitable uncertainties that come with limited sample inferences.

Only by assessing and reporting the uncertainty of inferences, can it be possible to make any statement about the extent to which the sample results are generalized to the population. For example, Strack et al.41 studied whether participants thought comics were more interesting when they held a pen with their teeth (which would cause a smile) instead of their lips (which would cause a pout). On the Likert 10-point scale, the author observed that the original effect size was 0.82 units. In order to explain this result, the uncertainty of the relevant reasoning must be known. In this case, the 95% confidence interval ranges from -0.05 to 1.69, indicating that the data is inconsistent with a wide range of effect size estimates (including negligible or negative effect sizes).

In almost all statistics courses, students are taught not only to provide a summary of statistical tests (such as F, t, and P values ​​and related degrees of freedom), but also to provide parameter point estimates (ie regression weights, effect sizes) and The related uncertainty (ie standard error, posterior distribution, confidence interval, confidence interval). However, there is a gap between what is taught and what is practiced. Research on published articles in Physiology20, Social Sciences42 and Medicine43,44 shows that error bars, standard errors, or confidence intervals do not always appear. Popular indicators such as Cronbach's alpha (a measure of the reliability of test scores) are rarely provided with a measure of inference uncertainty.

We believe that there can be no acceptable reason for ignoring the uncertainty of inference in any report.

Although it is not such a restriction, it should be noted that the uncertainty of reasoning always needs to be quantified relative to the reasoning goal-does the researcher want to generalize people, stimuli, time points, or other dimensions? The correct method of calculating the standard error depends on the researcher's purpose.

Various guidelines strongly recommend that effect size estimates are accompanied by uncertainty in measurement in the form of standard errors or confidence intervals. For example, the publication manual of the American Psychological Association (6th edition, p. 34) states: “When providing point estimates (ie, sample mean or regression coefficient), always include the relevant variability (precision) measure with the used An indication of the specific metric (i.e. standard error)". In addition, the International Committee of Medical Journal Editors45 clearly recommends that “where possible, quantify research results and provide appropriate measurement errors or uncertainty indicators (such as confidence intervals)” (Ref. 45, page 17).

This box outlines how each of the seven programs discussed in this perspective satisfies the Merton Specification. The following table gives an overview.

A well-designed visualization shows the key aspects of the data at a glance. In addition, by giving readers a more comprehensive understanding of the data and related statistics, visualization can support or weaken the conclusions drawn by researchers, or help readers find alternative ways to interpret the results and analyze the data.

Acknowledging the uncertainty of inferences (that is, by providing standard errors or confidence intervals) facilitates open communication. In addition, quantitative inference uncertainty signals indicate that researchers openly acknowledge the degree to which their measurements are imprecise, especially when the sample size is small. Finally, the explicit recognition of the uncertainty of the inference may prompt readers to question the extent to which the sample results are generalized to the population.

When researchers share only the results from a single data preprocessing pipeline, they may inadvertently hide important information. If the result proves to be sensitive to a particular preprocessing option, it needs to be suspicious and may trigger a debate about the importance and rationality of the relevant data preprocessing option (compare with reference 50, page 308).

Similar to the previous section, only reporting the results of a single model may inadvertently hide important information.

The multi-analyst approach can reveal whether different analyst teams draw convergent or different conclusions from the same data set. By including other analysts with different backgrounds and interests, the potential impact of any individual analyst's own interests can be offset. The methods of multiple analysts also arouse suspicion by revealing alternative statistical views of the data.

It can be said that there is little need for selfless analysts to exaggerate statements, impress critics, and downplay signs of inappropriate models. The analysts who contribute to organized skepticism do not try to suppress suspicion—they are not defensive, and they do not want to protect their work from good faith scrutiny by their peers.

All confidentiality of data is a restriction on the accumulation of knowledge and violates the spirit of science. All interested researchers should have access to relevant and properly anonymized data. It is important that sharing data enables skeptical eyes to examine the results carefully and facilitate quality control.

By evaluating the impact of possible alternative data preprocessing options (ie, examining the "data multiverse"46), the analyst determines the fragility or robustness of the review results.

Data multiverse analysis reveals the vulnerability or robustness found under reasonable alternative data preprocessing options. This prevents researchers from becoming victims of hindsight bias and motivational reasoning, which may cause them to unknowingly report only the preprocessing pipeline that produces the most compelling results47,48. But even a completely unbiased analysis will benefit from data multiverse analysis because it reveals uncertainties that would otherwise be hidden.

For example, Steegen et al. 46 re-examined the results of Durante et al. 49 and they reported the interaction between relationship status (ie, single or not) and menstrual cycle (ie, whether or not to have children) on the reported religious beliefs. After applying a series of 180 different data preprocessing procedures (that is, five different ways of dividing women into high fertility rates and low fertility rates), the multiverse reanalysis showed that the 180 P values ​​obtained are evenly distributed in Between 0 and 1, indicating that the reported interaction is very fragile.

The idea of ​​assessing sensitivity to data preprocessing options can be traced back to at least De Groot (Ref. 47, page 190) and Leamer (Ref. 50, page 308), and was reintroduced by Simmons et al. 48 and Steegen et al. . 46. ​​In the field of functional magnetic resonance imaging, Carp51 and Poldrack et al.52 both emphasized the hidden effects of different specious preprocessing pipelines. In psychology, the most recent applications are Bastiaansen et al.3 and Wessel et al.53. Nevertheless, most empirical articles do not report the results of data multiverse analysis.

One practical limitation of the data multiverse is the additional work it requires. Another limitation can be found in the ambiguity surrounding the definition of the data multiverse. The analyst must determine what constitutes a sufficiently representative set of preprocessing options and whether all preprocessing options are equally reasonable in order to give them equal weight in the multiverse analysis. The last limitation is that it is not always clear how to interpret the results of the data multiverse analysis. Certain graphical formats can be used to facilitate interpretation, and these graphical formats cluster related pipes (ie, specification curves)54.

Simmons et al.48 provide some specific guidelines for evaluating data preprocessing options (see Author Requirements, Articles 5 and 6), but it is difficult to provide general guidelines because "[...] multiverse analysis is highly specific to Contextual and subjective in nature. Listing alternative options for data construction requires judging which options can be considered reasonable, and usually depends on the experimental design, the research question, and the researcher conducting the research" (Ref. 46, No. 709 Page). More general guidelines related to preprocessing selection reports are given in the ASA Ethics Guide33. It mentioned that in order to ensure the integrity of the data and methods, moral statisticians “[w] acknowledge the data editing procedures when reporting the validity of the data used, including any imputation and missing data mechanisms” (Ref. 33, p. 2 pages).

By evaluating the impact of possible alternative statistical models (that is, examining the model multiverse), analysts measure the fragility or robustness of statistical conclusions.

Similar to the data multiverse analysis discussion in the previous section, the model multiverse analysis examines the vulnerability or robustness of the findings under reasonable alternative statistical modeling options. Modeling choices include differences in estimators and fitting systems, but also differences in model specifications and variable selection. Reporting the results of multiple plausible models reveals the uncertainty that would be hidden if only one model was accepted. In addition, this approach protects analysts from hindsight bias and motivational reasoning, which may unknowingly lead them to choose a single model that produces the most pleasing conclusions. For example, Patel et al.55 quantified the variability of results under different model specifications. They treated 13 clinical, environmental, and physiological variables as potential covariates of the correlation between 417 self-reported clinical and molecular phenotypes and all-cause mortality. Therefore, they calculated the P-values ​​of 213 = 8,192 models and checked the instability of the reasoning, which they called the "effect vibration".

Although the idea of ​​model multiverse can be traced back at least to De Groot47 and Leamer50, most empirical researchers still draw conclusions based on a single analysis (but also see References 56,57).

As in the case of constructing a data multiverse, the practical limitation of the multiverse model is that it requires additional work for analysts and readers. Recent work has shown that the number of reasonable models can be very large4,7. The amount of information in the multiverse is also different, and readers need to evaluate whether the multiverse has a significantly different model, or just run the same model multiple times. The model space can be overwhelming; any single analyst will naturally be attracted to the subset of models they are familiar with (or, unknowingly, the subset of models that produce the most pleasing or most consistent with previously expected results) . In addition, Del Giudice et al. (Reference 58, page 5) believes that "by expanding the size of the analysis space, it is ironic that unreasonable standardized combinatorial explosions may exaggerate the perceptual detail and authority of the multiverse, while greatly reducing the information of the multiverse Part. At the same time, the size of the normative space makes it more difficult to check the results of potentially relevant results. If it is not checked, multiverse-style analysis may produce analytical'black holes': large-scale analysis will swallow the real interest Impact, but because of the exhaustiveness and large scale they perceive, they will capture any existing information in displays and abstracts that are difficult to understand."

Since the construction of the model multiverse depends on the knowledge and expertise of the analyst, it is challenging to provide a general guide. However, for relatively simple regression models, there do exist clear guidelines55,59. In addition, Simonsohn et al. 54 recommended specification curve analysis, while Dragicevic et al. 60 recommended that the results be presented interactively. The ASA Code of Ethics 33 mentions that in order to fulfill their responsibilities to funders and clients, ethical statisticians “[as much as possible] provide clients or employers with options for effective alternative statistical methods that may differ in scope, cost, or precision” (Ref. Reference 33, page 3). However, the ASA does not mention that researchers have the same responsibility for their scientific colleagues, although this may be implicit.

A general suggestion for building a comprehensive model multiverse is to work with statisticians with complementary expertise to take us to the next part.

By allowing multiple analysts to independently analyze the same data set, researchers can reduce the impact of analyst specific choices on data preprocessing and statistical modeling.

The multi-analyst method reveals the uncertainty caused by the subjective choice of a single analyst and promotes the application of a wider range of statistical techniques. When the analyst's conclusions converge, this will increase people's confidence in the robustness of the discovery; when the conclusions are divergent, it will weaken this confidence and stimulate careful study of the statistical reasons for the lack of consensus.

For example, a study by Silberzahn et al.7 used a multi-analyst approach in which 29 teams of analysts used the same data set to check whether the skin color of football players would affect their likelihood of winning a red card. Although most analysis teams report that darker-skinned players have a higher probability of receiving a red card, some teams have reported invalid results. The analysis methods used by the team differ greatly in data preprocessing and statistical modeling (that is, including covariates, link functions, and hierarchical structure assumptions).

The predecessor of the multi-analyst method was the "Cuneiform Contest" in 1857, in which four scholars independently translated an ancient Assyrian inscription that had not been seen before 61. The overlap between their translations—sent to the Royal Asiatic Society in a sealed envelope while being opened and inspected by a separate review committee—is shocking and removes any doubts about the methods used to decipher such inscriptions . Although recent examples exist in psychology and neuroscience3,4,5,7,62,63, the methods of multiple analysts have never become popular in practice.

As in the case of constructing a data multiverse and a model multiverse, the practical limitation of the multi-analyst approach lies in the additional work it requires, especially in (1) looking for knowledgeable analysts who are interested in participating; (2) recording Data set, describe the research problem and determine the goal of statistical inference; (3) organize the preliminary responses of each team, and possibly coordinate a round of review and feedback. Although differences of opinion should be respected, there needs to be a way to filter out analytical methods that involve obvious errors. Another limitation concerns the possible homogeneity of analysts. For example, all participating analysts may receive strict education in the same school, share cultural or social prejudices, or just make the same mistakes. In this case, the results may produce an exaggerated sense of certainty in the conclusions drawn. These potential limitations can be mitigated by selecting different analyst groups and incorporating feedback and revision options in the process 7, round table discussions 5 or more systematic use of the Delphi methodology64.

There are no clear guidelines regarding the multi-analyst approach. The optimal number of analysts we recommend to include depends on the complexity of the data and the importance of the research question (i.e., the clinical trial for the effectiveness of new drugs for the new coronavirus disease (COVID-19) in 2019) guarantees a relatively large number of analysts) And the probability that the analyst can reasonably reach different conclusions (ie, there may be multiple ways to explain the research problem, and there may be multiple dependent variables and predictors that may or may not be related).

When choosing an analyst, care should be taken to ensure heterogeneity, diversity, and balance. Specifically, attention should be paid to the potential bias influence of the analyst's specific background knowledge, culture, education, and career stage.

The ASA guidelines emphasize the legitimacy and value of alternative analytical methods, stating that “[t] statistical practice needs to consider the entire range of possible interpretations of observed phenomena. Different observers [...] can draw different and possible Disagreement about the rationality of different interpretations" (Ref. 33, p. 5).

By interpreting the results modestly, the analyst clearly acknowledges that there are any doubts about the importance, reproducibility, and universality of the scientific claims at hand.

Appropriately presented scientific propositions enable readers to evaluate the usual results: not final, but tentative results pointing in a certain direction, with considerable uncertainty surrounding their universality and scope. Excessive publicity of the results may lead to misallocation of public resources to methods that are actually not properly verified and are not yet ready to be applied in practice. In addition, the researchers themselves may lose long-term credibility due to the short-term benefits of getting more attention and higher citations. In addition, after publicly promising a bold claim, it is difficult to admit that one’s initial assessment was wrong; in other words, overconfidence is not conducive to science learning.

Even at the moment of great success, truly humble scientists remain skeptical. For example, when James Chadwick found experimental evidence of neutrons, this discovery won him the Nobel Prize, and he modestly conveyed it under the title "The Possible Existence of Neutrons"65.

Tukey66 has already said, “Leave aside unethical practices, one of the most dangerous [[……] practices in data analysis [……]] is to use formal data analysis procedures for sanctification, retaining all critical conclusions, in order to grant Approved.” (Ref. 66, p. 13). Almost 60 years later, an editorial in Nature Human Behavior warned its readers that "the conclusive narrative left no room for ambiguity or contradiction or uncertain results" (Ref. 67, page 1). Similarly, Simons et al. 68 suggested adding a mandatory "constraint on universality" statement to the discussion section of all major research articles in the field of psychology to prevent authors from overly exaggerating universality. This shows that if the Merton norm is widely adopted, scientific humility will be less than we expected. There are some obvious signs of lack of humility. First, the frequency of stronger languages ​​(such as amazing, groundbreaking, and unprecedented words) seems to have increased in the past few decades69. Second, the dichotomy of discovery (that is, ignoring the uncertainty inherent in statistical inference) is a common practice (Ref. 42, see paragraph 4.3). Third, textbooks on how to write essays (usually a reflection of current practice) often explicitly encourage authors to exaggerate70,71.

Publications and funding are important for the survival of science. Coupled with the fact that journals and funders tend to prefer groundbreaking and clear results, humbly explaining the results may be detrimental to one's success. Encouraging this Mertonian approach may require institutional changes, although some believe that scientists should not hide behind the system when defending their actions72.

We can contribute to the improvement of intellectual humility in many ways. First, when we serve as reviewers for papers and grant proposals, we can encourage intellectual humility in the work of others73. Since the profession of reviewers has nothing to do with how they evaluate the paper, they can actively review under the condition of a more moderate statement of results. Hoekstra and Vazire73 proposed a series of suggestions for increasing modesty in the traditional part of an empirical article, and the author can also use these suggestions. An example includes "Titles should not state or imply claims stronger than reasonable (ie, causal claims without strong evidence)" (Ref. 73, page 16).

The ASA guidelines also state: “[t] Moral statisticians remain frank about any known or suspected limitations, flaws, or prejudices in data that may affect the completeness or reliability of statistical analysis” (Ref. 33, page 2) .

By sharing data and analyzing code, researchers provide the basis for their scientific claims. Ideally, data and code should be open, free, and shared in a way that promotes reuse.

Since there are many ways to process and analyze data7,46, sharing code promotes repeatability and encourages sensitivity analysis. Sharing data and code also enables other researchers to establish the validity of the original analysis, which can promote collaboration, but can also serve as a protective measure against data loss. When publishing his general intelligence theory, Spearman74 shared his data as an appendix to the article. A century later, this vision has allowed scientists to use this data set for research and education. Since Spearman made his data public, other researchers can establish the reproducibility and universality of the research results.

Data sharing has never been easier. Public repositories provide free storage space for research materials, data (i.e. open science framework) and code (i.e. GitHub). Although data sharing is not a common practice in most scientific fields, some recent initiatives (i.e. open data/code/material badge 75), standards (TOP Guide 76), journals (i.e. scientific data) and checklists (i.e., Transparency Checklist77) ) Is helping to promote this research practice. When sharing raw data is not feasible, researchers can provide aggregated data summaries—for example, data used to generate certain graphs or covariance matrices of related variables.

Restrictions imposed by funders, universities and other institutions’ ethics review boards, collaborators, and legal contracts may limit the extent to which data can be shared publicly. There may also be practical considerations (ie sharing big data), data usage agreements, privacy rights, and institutional policies that may restrict sharing intentions. It is still important to let readers understand the accessibility of the analysis data. Please note that these restrictions should not be applied to the analysis code, provided that the code only reflects the analysis operations of the researcher and does not have any data privacy issues.

An important principle of shared data is that they should be searchable, accessible, interoperable and reusable (FAIR78). Several guides discuss the practical79 and ethical80 aspects of data sharing. Researchers should follow the data sharing procedures and requirements in their field81,82, and state the accessibility of data in the research report76,83. The ASA Code of Ethics for Statistical Practice33 states that ethical statisticians “[p] promote the sharing of data and methods as much as possible” and “[m] produce documents suitable for repeated analysis, metadata research, and other research conducted by qualified investigators "(Page 5 of reference 33).

If the statistical literature has any guiding significance, one may conclude that statisticians rarely reach agreement. For example, the 2019 Special Issue of the American Statistician published 43 articles on P-values. In their editorial, Wasserstein et al.13 stated that “the voices in the 43 papers in this issue are not uniform”. However, although there are still differences on the basis of statistical inference, we believe that statisticians still have a lot in common, especially in terms of their professional ethics. In order to examine this ethical dimension more systematically, we first considered the Mertonian norms that represent the spirit of science and outlined a non-exhaustive list of seven specific, teachable, and implementable practices that we believe need to be spread more widely.

Essentially, these practices are to increase transparency and publicly recognize uncertainty. After clearly acknowledging the consensus on such practices, we believe that the generally discussed controversial issue (ie P-value) may become less important. In fact, in a letter to his frequent guest nemesis Sir Ronald Fisher, Sir Harold Jeffries of the Bayesian school wrote: "Your letter confirms my previous impressions. That is, only once in the blue moon we would disagree with the inferences drawn under any circumstances. In special circumstances, under special circumstances, we would be a little skeptical" (Ref. 84, p. 162).

We hope that the proposed statistical practice will comprehensively improve the quality of data analysis, especially in applied disciplines that may be unfamiliar with statistical ethics, which statisticians may take for granted. In addition, instead of expecting them to be absorbed through penetration, we believe that it is important to explicitly include these ethical considerations and their statistical results in the statistics curriculum. In addition to the statistical techniques discussed here, other statistical techniques may also promote Mertonian ideals. We hope that this contribution can promote a deeper exploration of how data analysis in the application domain becomes more transparent, more informative, and more open to the uncertainties that inevitably arise in any statistical data analysis problem.

Yes, the Bayesian variant of IJ 46656. Yes. statistics. 25, 62-63 (1971).

Aczel, B. etc. Discussion points of Bayesian inference. Nat. Humph. behavior. 4, 561–566 (2020).

Bastiansen, JA etc. Is it time to become personalized? The influence of the researcher's choice on the choice of treatment goals using empirical sampling methods. J. Psychology. Reservoir 137, 110211 (2020).

Botvinik-Nezer, R. etc. Many teams differ in their analysis of individual neuroimaging data sets. Nature 582, 84–88 (2020).

PubMed PubMed Central Google Scholar 

van Dongen, N. et al. Multiple perspectives for reasoning in two simple statistical scenarios. Yes. statistics. 73, 328–339 (2019).

Salganic, MJ, etc. Measure the predictability of life outcomes through scientific large-scale collaboration. Process National Academy of Sciences. science. United States 117, 8398-8403 (2020).

CAS PubMed PubMed Central Google Scholar 

Silberzahn, R. et al. Many analysts, one data set: Make it transparent how changes in analysis choices affect results. Advanced method practice. psychology. science. 1, 337–356 (2018).

Amrhein, V., Greenland, S. & McShane, BB Retire statistical significance. Nature 567, 305–307 (2019).

Benjamin, DJ etc. Redefine statistical significance. Nat. Humph. behavior. 2, 6-10 (2018).

Harlow, LL, Mulaik, SA & Steiger, JH (eds) What if there is no significance test? (Lawrence Erlbaum, Mahwah, 1997).

McShane, BB, Gal, D., Gelman, A., Robert, C. & Tackett, JL Abandon statistical significance. Yes. statistics. 73, 235–245 (2019).

Wasserstein, RL & Lazar, NA ASA Statement on p-value: context, process, and purpose. Yes. statistics. 70, 129–133 (2016).

Wasserstein, RL, Schirm, AL & Lazar, NA Towards a world beyond "p <0.05". Yes. statistics. 73, 1-19 (2019).

Merton, RK (ed.) in The Sociology of Science: Theoretical and Empirical Investigations 267–278 (Univ. of Chicago Press, 1973).

Tukey, JW Interpretive Data Analysis (Addison-Wesley, 1977).

Gelman, A. Exploratory data analysis of complex models. J. Calculation. Graphics. statistics. 13, 755–779 (2004).

Gabry, J., Simpson, D., Vehtari, A., Betancourt, M. & Gelman, A. Visualization in Bayesian workflow. JR Statistical Society. A 182, 389–402 (2019).

Heathcote, A., Brown, SD & Wagenmakers, E.-J. Introduction to Model-Based Cognitive Neuroscience (eds Forstmann, B. & Wagenmaker, E.-J.) 25–48 (Springer, 2015).

Kerman, J., Gelman, A., Zheng, T. & Ding, Y. in Handbook of Data Visualization (eds Chen, C. et al.) 709–724 (Springer, 2008).

Weissgerber, TL, Milic, NM, Winham, SJ & Garovic, VD Beyond Bar and line charts: the era of new data display paradigms. Public Science Library Biology. 13. e1002128 (2015).

PubMed PubMed Central Google Scholar 

Healy, K. & Moody, J. Data visualization in sociology. install. Pastor society. 40, 105–128 (2014).

Gilbert, EW Pioneer England Health and Disease Map. Geography J. 124, 172–183 (1958).

Anscombe, FJ Charts in statistical analysis. Yes. statistics. 27, 17-21 (1973).

Matejka, J. & Fitzmaurice, G. The same statistics, different graphs: Generate data sets with different appearances and the same statistics through simulated annealing. in the process. 2017 CHI Conference on Human Factors in Computing Systems 1290-1294 (2017).

Playfair, W. Business and Political Atlas: A copperplate chart showing the progress of business, income, expenditure, and debt in England throughout the 18th century (1786).

Everitt, BS, Landau, S., Leese, M. & Stahl, D. Cluster analysis (John Wiley & Sons, 2011).

Chang, W., Cheng, J., Allaire, J., Xie, Y. and McPherson, J. Shining: R's web application framework, version 1.7.0, http://CRAN.R-project.org /package = shiny (2020).

iNZight team iNZight v.4.0.2. https://inzight.nz (2020).

Cairo, A. How Charts Lie: Getting Smarter about Visual Information (WW Norton & Company, 2019).

Gelman, A. Why tables are indeed much better than graphs. J. Calculation. Graphics. statistics. 20, 3-7 (2011).

Wainer, H. How to display data badly. Yes. statistics. 38, 137–147 (1984).

Tufte, ER Visual Display of Quantitative Information (Graphics Publishing House, 1973).

American Statistical Association Professional Ethics Committee Statistical Practice Code of Ethics, https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx (2018).

Diamond, L. & Lerch, FJ Fading frame: data display and frame effect. Decide. science. 23, 1050–1071 (1992).

Chen, C., Härdle, W. and Unwin, A. (Editor) Data Visualization Manual (Springer, 2008).

Cleveland, WS & McGill, R. Graphic perception: Theories, experiments and applications of graphic method development. J. Morning statistics. Vice. 79, 531–554 (1984).

Gelman, A., Pasarica, C. and Dodhia, R. Let us practice our proposition: turn tables into graphs. Yes. statistics. 56, 121–130 (2002).

Mazza, R. Introduction to Information Visualization (Springer Science & Business Media, 2009).

Wilke, CO Data Visualization Fundamentals: Making an informative and compelling digital primer (O'Reilly Media, 2019).

Wilkinson, L. Graph Grammar (Springer Science & Business Media, 1999).

Strack, F., Martin, LL and Stepper, S. Conditions that inhibit and promote human smiles: a non-invasive test of the hypothesis of facial feedback. J. Pace. society. psychology. 54, 768–777 (1988).

Hoekstra, R., Finch, S., Kiers, HA & Johnson, A. Probability as certainty: dichotomy thinking and the abuse of p-values. psychology. bull. Rev. 13, 1033–1037 (2006).

Cooper, RJ, Schriger, DL & Close, RJ Graphics Literacy: The quality of graphics in high-circulation journals. install. new. medicine. 40, 317–322 (2002).

Schriger, DL, Sinha, R., Schroter, S., Liu, PY & Altman, DG From submission to publication: A retrospective review of tables and figures in the randomized controlled trial cohort submitted to the British Medical Journal. install. new. medicine. 48, 750–756 (2006).

Recommendations of the International Committee of Medical Journal Editors on the conduct, reporting, editing and publication of the academic work of medical journals, http://www.icmje.org/icmje-recommendations.pdf (2019).

Steegen, S., Tuerlinckx, F., Gelman, A. & Vanpaemel, W. Improve transparency through multiverse analysis. perspective. psychology. science. 11, 702–712 (2016).

De Groot, AD The meaning of "meaning" in different types of research [Translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han LJ van der Maas] . Acta Psychology 148, 188–194 (2014).

Simmons, JP, Nelson, LD, and Simonsohn, U. Psychology of false positives: Undisclosed flexibility in data collection and analysis allows for the presentation of anything important. psychology. science. 22, 1359–1366 (2011).

Durante, KM, Rae, A. & Griskevicius, V. Fluctuating female voting: politics, religion, and the ovulation cycle. psychology. science. 24, 1007–1016 (2013).

Leamer, EE sensitivity analysis will help. Yes. economy. Rev. 75, 308–313 (1985).

Carp, J. On multiple (methodological) worlds: Estimate the analytical flexibility of fMRI experiments. front. Neuroscience. 6, 149 (2012).

PubMed PubMed Central Google Scholar 

Poldrack, RA etc. Scanning field of view: Towards transparent and repeatable neuroimaging research. Nat. Pastor Neurosci. 18, 115–126 (2017).

CAS PubMed PubMed Central Google Scholar 

Wessel, I., Albers, C., Zandstra, ARE & Heininga, VE Multiverse analysis of early attempts to copy memory inhibition through Think/No-think Task. Memories 28, 870–887 (2020).

Simonsohn, U., Nelson, LD & Simmons, JP Specification curve analysis. Nat. Humph. behavior. 4, 1208–1214 (2020).

Patel, CJ, Burford, B. & Ioannidis, JP The evaluation of the impact vibration caused by the model specification can prove the instability of the observation correlation. J. Clinical. Epidemic. 68, 1046–1058 (2015).

PubMed PubMed Central Google Scholar 

Athey, S. & Imbens, GW Machine learning methods that economists should know. install. Pastor economy. 11, 685–725 (2019).

Levine, R. & Renelt, D. Sensitivity analysis of cross-country growth regression. Yes. economy. Rev. 82, 942–963 (1992).

Del Giudice, M., Gangestad, SW, and Steven, W. The Traveler’s Guide to the Multiverse: Promises, Pitfalls, and an Evaluation Framework for Analytical Decisions. Advanced method practice. psychology. science. 4, 1-15 (2021).

Hoeting, JA, Madigan, D., Raftery, AE & Volinsky, CT Bayesian model averaging: a tutorial. statistics. science. 14, 382–401 (1999).

Dragicevic, P., Jansen, Y., Sarma, A., Kay, M., and Chevalier, F. Improve the transparency of research papers through explorable multiverse analysis. in the process. The Human Factors in CHI Computing System Conference 1-15 (2019) in 2019.

Rawlinson, H., Talbot, F., Hincks, E. & Oppert, J. Inscription of Tiglath Pileser I., King of Assyria, 1150 BC, by Sir Henry Rawlinson, Fox Talbot, Esq., Dr. Hincks), and Dr. Oppert (published by the Royal Asiatic Society) (JW Parker and Son, 1857).

Boehm, U., Hawkins, GE, Brown, SD, van Rijn, H. & Wagenmakers, E.-J. Monkeys and men: the impatience of perceptual decision-making. psychology. bull. Rev. 23, 738–749 (2016).

Dutilh, G. etc. The quality of response time data inference: blind collaborative evaluation of the effectiveness of cognitive models. psychology. bull. Rev. 26, 1051–1069 (2019).

Thangaratinam, S. & Redman, CW Delphi technology. Object. Gynecology. 7, 120–125 (2005).

Chadwick, J. The possible existence of neutrons. Nature 129, 312 (1932).

Tukey, JW The future of data analysis. install. math. statistics. 33, 1-67 (1962).

Tell it as it is. Nat. Humph. behavior. 4, 1 (2020).

Simons, DJ, Shoda, Y. & Lindsay, DS Constraints on generality (cog): a proposed supplement to all empirical papers. perspective. psychology. science. 12, 1123–1128 (2017).

Vinkers, CH, Tijdink, JK & Otte, WM The use of positive and negative words in abstracts of scientific publications from 1974 to 2014: a retrospective analysis. BMJ 351, h6467 (2015).

PubMed PubMed Central Google Scholar 

Bem, DJ in The Compleat Academic: A Practical Guide for the Beginning Social Scientist (eds Zanna, MR & Darley, JM) 171–201 (Lawrence Erlbaum Associates, 1987).

van Doorn, J. et al. Strong public statements may not reflect the researcher’s personal beliefs. Meaning 18, 44–45 (2021).

Yarkoni, T. No, this is not an incentive measure-it is you, https://www.talyarkoni.org/blog/2018/10/02/no-its-not-the-incentives-its-you/ (2018 ).

Hoekstra, R. & Vazire, S. desires greater intellectual humility in science. Preprint https://doi.org/10.31234/osf.io/edh2s (2020).

Spearman, C. General intelligence, objectively determined and measured. Yes. J. Psychology. 15, 201–293 (1904).

Kidwell, MC etc. Recognize the badge of open practice: a simple, low-cost, and effective way to increase transparency. Public Science Library Biology. 14. e1002456 (2016).

PubMed PubMed Central Google Scholar 

Nosek, B. etc. Promote an open research culture. Science 348, 1422–1425 (2015).

CAS PubMed PubMed Central Google Scholar 

Aczel, B. etc. A consensus-based transparency checklist. Nat. Humph. behavior. 4, 4-6 (2020).

Wilkinson, MD, etc. FAIR Scientific Data Management Guidelines. science. Data 3, 160018 (2016).

PubMed PubMed Central Google Scholar 

Klein, O. et al. A practical guide to transparency in psychological sciences. Colabra Psychology. 4, 20 (2018).

Alter, G. & Gonzalez, R. is responsible for data sharing practices. Yes. psychology. 73, 146–156 (2018).

PubMed PubMed Central Google Scholar 

Wagenmakers, E.-J., Kucharsky, S. & The JASP Team (eds) The JASP Data Library (JASP Publishing, 2020).

Teichman, DB etc. Clinical trial data sharing statement: A requirement of the International Medical Journal Editorial Board. Journal of the American Medical Association 317, 2491–2492 (2017).

Albersberg, IJ et al. Make science transparent by default; introduce the TOP statement. Preprint https://osf.io/sm78t (2018).

Bennett, JH (ed.) Statistical Inference and Analysis: Selected Correspondence of RA Fisher (Clarendon Press, 1990).

Anderson, MS, Martinson, BC & De Vries, R. Normative Disorders in Science: Results of a National Survey of American Scientists. J. Empire. Reservoir hum. Reservoir Ethics 2, 3-14 (2007).

We thank N. Lazar for his comments on the draft. We thank everyone who participated in drafting the initial list of statistical procedures during the hackathon at the 2019 Psychological Science Improvement Association meeting in Rotterdam, the Netherlands. This work was partially funded by E.-JW (no. 283876) from the European Research Council (ERC), A. Sarafoglou (no. 406-17-568) from the Netherlands Organization for Scientific Research (NWO), and the Netherlands The scientific organization Vidi awarded DvR (No. 016.Vidi.188.001) from NWO.

Department of Psychology, University of Amsterdam, The Netherlands

Eric Jane Wagenmex, Alexandra Sarafoglu and Noah Van Dongen

School of Public Health and Primary Care, Maastricht University, Maastricht, The Netherlands

Heymans Institute of Psychology, University of Groningen, The Netherlands

Donders Institute for Brain, Cognition and Behavior, Radboud University, Nijmegen, The Netherlands

School of Business Administration, Prague University of Economics, Czech Republic

Štěpán Bahník

Department of Educational Science, University of Groningen, Groningen, The Netherlands

School of Psychology and Brain Research Centre, University of Auckland, Auckland, New Zealand

Department of Psychology, University of Groningen, Groningen, The Netherlands

Don van Ravenzwaaij & Jorge Tendeiro

Rotterdam School of Management, Erasmus University Rotterdam, Rotterdam, Netherlands

Department of Psychology, University of Münster, Germany

Artificial Intelligence and Data Innovation Research and Academia-Government-Community Collaborative Education and Research Center Office, Hiroshima University, Hiroshima, Japan

Institute of Psychology, ELTE Eotvos Lorand University, Budapest, Hungary

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Conceptualization: E.-JW, A. Sarafoglou and BA Project management: BA writing (manuscript preparation): E.-JW, A. Sarafoglou, CA, JA, Š.B., NvD, RH, DM, DvR, A. Sluga, JT and BA writing (review and editing): E.-JW, A. Sarafoglou, SA, CA, JA, Š.B., NvD, RH, DM, DvR, A. Sluga, FS, JT and BA

The author declares no competing interests.

Peer review information Nature Human Behavior thanks David Hand and Ulf Toelch for their contributions to the peer review of this work.

The publisher states that Springer Nature remains neutral on the jurisdiction claims of published maps and agency affiliates.

Wagenmakers, EJ., Sarafoglou, A., Aarts, S. etc. Seven steps to improve the transparency of statistical practices. Nat Hum Behav 5, 1473–1480 (2021). https://doi.org/10.1038/s41562-021-01211-8

DOI: https://doi.org/10.1038/s41562-021-01211-8

Anyone you share the following link with can read this content:

Sorry, there is currently no shareable link in this article.

Provided by Springer Nature SharedIt content sharing program

Nature Human Behavior (Nat Hum Behav) ISSN 2397-3374 (online)